Adaptation de la matrice de covariance pour l'apprentissage par renforcement direct
Identifieur interne : 001694 ( Main/Exploration ); précédent : 001693; suivant : 001695Adaptation de la matrice de covariance pour l'apprentissage par renforcement direct
Auteurs : Olivier Sigaud [France] ; Freek Stulp [France]Source :
- Revue d'intelligence artificielle [ 0992-499X ] ; 2013.
Descripteurs français
- Pascal (Inist)
- Adaptation, Intelligence artificielle, Boîte noire, Apprentissage renforcé, Estimation statistique, Mise à jour, Robotique, Adressage, Politique, Commande stochastique, Commande optimale, Contrôle optimal, Matrice covariance, Modélisation, Optimisation, Approche probabiliste, Méthode moyenne, Analyse statistique, Fonction coût, Entropie, Méthode matricielle, Algorithme évolutionniste, Intégrale parcours, Variance, ..
- Wicri :
- topic : Intelligence artificielle, Robotique, Politique.
English descriptors
- KwdEn :
- Adaptation, Addressing, Artificial intelligence, Averaging method, Black box, Cost function, Covariance matrix, Entropy, Evolutionary algorithm, Matrix method, Modeling, Optimal control, Optimal control (mathematics), Optimization, Path integral, Policy, Probabilistic approach, Reinforcement learning, Robotics, Statistical analysis, Statistical estimation, Stochastic control, Updating, Variance.
Abstract
There has been a recent focus in reinforcement learning on addressing continuous state and action problems by optimizing parameterized policies. PI2 is a recent example of this approach. It combines a derivation from first principles of stochastic optimal control wilh tools from statistical estimation theory. In this paper, we consider PI2 as a member of the wider family of methods which share the concept of probability-weighted averaging to iteratively update parameters to optimize a cost function. We compare PI2 to other members of the same family - the 'Cross-Entropy Method' and 'Covariance Matrix Adaptation - Evolutionary Strategy' - at the conceptual level and in terms of performance. The comparison suggests the derivation of a novel algorithm which we call PI2-CMA for "Path Integral Policy Improvement with Co-variance Matrix Adaptation ". PI2-CMA's main advantage is that it determines the magnitude of the exploration noise automatically. We illustrate this advantage on a non-trivial simulated robotics experiment.
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000061
- to stream PascalFrancis, to step Curation: 000946
- to stream PascalFrancis, to step Checkpoint: 000040
- to stream Main, to step Merge: 001711
- to stream Main, to step Curation: 001694
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="fr" level="a">Adaptation de la matrice de covariance pour l'apprentissage par renforcement direct</title>
<author><name sortKey="Sigaud, Olivier" sort="Sigaud, Olivier" uniqKey="Sigaud O" first="Olivier" last="Sigaud">Olivier Sigaud</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Institut des Systèmes Intelligents et de Robotique Université Pierre et Marie Curie CNRS UMR 7222 4, place Jussieu</s1>
<s2>75252 Paris</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName><region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Stulp, Freek" sort="Stulp, Freek" uniqKey="Stulp F" first="Freek" last="Stulp">Freek Stulp</name>
<affiliation wicri:level="3"><inist:fA14 i1="02"><s1>Cognitive Robotics École Nationale Supérieure de Techniques Avancées (ENSTA-ParisTech) 32, Boulevard Victor</s1>
<s2>75015 Paris</s2>
<s3>FRA</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName><region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3"><inist:fA14 i1="03"><s1>FLOWERS Research Team INRIA Bordeaux Sud-Ouest 351, Cours de la Libération</s1>
<s2>33405 Talence</s2>
<s3>FRA</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName><region type="region" nuts="2">Nouvelle-Aquitaine</region>
<region type="old region" nuts="2">Aquitaine</region>
<settlement type="city">Talence</settlement>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">13-0216767</idno>
<date when="2013">2013</date>
<idno type="stanalyst">PASCAL 13-0216767 INIST</idno>
<idno type="RBID">Pascal:13-0216767</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000061</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000946</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000040</idno>
<idno type="wicri:explorRef" wicri:stream="PascalFrancis" wicri:step="Checkpoint">000040</idno>
<idno type="wicri:doubleKey">0992-499X:2013:Sigaud O:adaptation:de:la</idno>
<idno type="wicri:Area/Main/Merge">001711</idno>
<idno type="wicri:Area/Main/Curation">001694</idno>
<idno type="wicri:Area/Main/Exploration">001694</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="fr" level="a">Adaptation de la matrice de covariance pour l'apprentissage par renforcement direct</title>
<author><name sortKey="Sigaud, Olivier" sort="Sigaud, Olivier" uniqKey="Sigaud O" first="Olivier" last="Sigaud">Olivier Sigaud</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Institut des Systèmes Intelligents et de Robotique Université Pierre et Marie Curie CNRS UMR 7222 4, place Jussieu</s1>
<s2>75252 Paris</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName><region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Stulp, Freek" sort="Stulp, Freek" uniqKey="Stulp F" first="Freek" last="Stulp">Freek Stulp</name>
<affiliation wicri:level="3"><inist:fA14 i1="02"><s1>Cognitive Robotics École Nationale Supérieure de Techniques Avancées (ENSTA-ParisTech) 32, Boulevard Victor</s1>
<s2>75015 Paris</s2>
<s3>FRA</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName><region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3"><inist:fA14 i1="03"><s1>FLOWERS Research Team INRIA Bordeaux Sud-Ouest 351, Cours de la Libération</s1>
<s2>33405 Talence</s2>
<s3>FRA</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName><region type="region" nuts="2">Nouvelle-Aquitaine</region>
<region type="old region" nuts="2">Aquitaine</region>
<settlement type="city">Talence</settlement>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Revue d'intelligence artificielle</title>
<title level="j" type="abbreviated">Rev. intell. artif.</title>
<idno type="ISSN">0992-499X</idno>
<imprint><date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Revue d'intelligence artificielle</title>
<title level="j" type="abbreviated">Rev. intell. artif.</title>
<idno type="ISSN">0992-499X</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Adaptation</term>
<term>Addressing</term>
<term>Artificial intelligence</term>
<term>Averaging method</term>
<term>Black box</term>
<term>Cost function</term>
<term>Covariance matrix</term>
<term>Entropy</term>
<term>Evolutionary algorithm</term>
<term>Matrix method</term>
<term>Modeling</term>
<term>Optimal control</term>
<term>Optimal control (mathematics)</term>
<term>Optimization</term>
<term>Path integral</term>
<term>Policy</term>
<term>Probabilistic approach</term>
<term>Reinforcement learning</term>
<term>Robotics</term>
<term>Statistical analysis</term>
<term>Statistical estimation</term>
<term>Stochastic control</term>
<term>Updating</term>
<term>Variance</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Adaptation</term>
<term>Intelligence artificielle</term>
<term>Boîte noire</term>
<term>Apprentissage renforcé</term>
<term>Estimation statistique</term>
<term>Mise à jour</term>
<term>Robotique</term>
<term>Adressage</term>
<term>Politique</term>
<term>Commande stochastique</term>
<term>Commande optimale</term>
<term>Contrôle optimal</term>
<term>Matrice covariance</term>
<term>Modélisation</term>
<term>Optimisation</term>
<term>Approche probabiliste</term>
<term>Méthode moyenne</term>
<term>Analyse statistique</term>
<term>Fonction coût</term>
<term>Entropie</term>
<term>Méthode matricielle</term>
<term>Algorithme évolutionniste</term>
<term>Intégrale parcours</term>
<term>Variance</term>
<term>.</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Intelligence artificielle</term>
<term>Robotique</term>
<term>Politique</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">There has been a recent focus in reinforcement learning on addressing continuous state and action problems by optimizing parameterized policies. PI<sup>2</sup>
is a recent example of this approach. It combines a derivation from first principles of stochastic optimal control wilh tools from statistical estimation theory. In this paper, we consider PI<sup>2</sup>
as a member of the wider family of methods which share the concept of probability-weighted averaging to iteratively update parameters to optimize a cost function. We compare PI<sup>2</sup>
to other members of the same family - the 'Cross-Entropy Method' and 'Covariance Matrix Adaptation - Evolutionary Strategy' - at the conceptual level and in terms of performance. The comparison suggests the derivation of a novel algorithm which we call PI<sup>2</sup>
-CMA for "Path Integral Policy Improvement with Co-variance Matrix Adaptation ". PI<sup>2</sup>
-CMA's main advantage is that it determines the magnitude of the exploration noise automatically. We illustrate this advantage on a non-trivial simulated robotics experiment.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
</country>
<region><li>Aquitaine</li>
<li>Nouvelle-Aquitaine</li>
<li>Île-de-France</li>
</region>
<settlement><li>Paris</li>
<li>Talence</li>
</settlement>
</list>
<tree><country name="France"><region name="Île-de-France"><name sortKey="Sigaud, Olivier" sort="Sigaud, Olivier" uniqKey="Sigaud O" first="Olivier" last="Sigaud">Olivier Sigaud</name>
</region>
<name sortKey="Stulp, Freek" sort="Stulp, Freek" uniqKey="Stulp F" first="Freek" last="Stulp">Freek Stulp</name>
<name sortKey="Stulp, Freek" sort="Stulp, Freek" uniqKey="Stulp F" first="Freek" last="Stulp">Freek Stulp</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001694 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001694 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Lorraine |area= InforLorV4 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:13-0216767 |texte= Adaptation de la matrice de covariance pour l'apprentissage par renforcement direct }}
This area was generated with Dilib version V0.6.33. |